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ABSTRACT 


In  many  critical  applications  of  digital  systems,  fault  tolerance  has  been  an  essential 
architectual  attribute  for  achieving  high  reliability.  In  recent  years,  the  concept  of  the 
performability  of  such  systems  has  drawn  the  attention  of  many  researchers.  In  this  pa¬ 
per,  we  develop  a  general  Markov  model  for  fault  tolerant  computer  systems.  Various 
important  performance  measures,  including  the  performability  measures  as  well  as  some 
new  performance  measures,  are  treated  in  a  unified  manner.  Futhermore  general  and 
efficient  computational  procedures  are  developed  for  calculating  these  performance  mea¬ 
sures  based  on  the  uniformization  technique  of  Keilson(l974,1979).  A  numerical  example 
is  given  to  illustrate  the  computational  procedures  developed. 
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§0  Introduction 


In  many  critical  applications  of  digital  systems  such  as  flight  control,  nuclear  plant 
control,  etc.,  the  need  for  achieving  high  reliability  has  made  fault-tolerance  an  essen¬ 
tial  architectual  attribute  of  digital  systems.  In  general,  to  achieve  high  reliability  re¬ 
quirements,  some  redundancy  techiniques  are  employed  where  systems  contain  multi¬ 
ple  copies  of  a  resource.  Typically  forms  of  redundant  structures  for  fault-tolerant  sys- 
1  terns  are  categorized  into  four  classes  and  combinations  thereof,  see  Beaudry (1978).  In 

Massive  Redundant  Systems,  redandunt  techniques  such  as  triple-modular  redundancy 
(Von  Neumann(l956)),  N-modular  redundancy(Mathur  and  Avizienis(1970)),  and  self- 
purging  redundancy(Losq(1976))  are  employed  where  the  same  task  is  executed  on  each 
equivalent  module  and  the  vote  on  the  outputs  is  taken  for  improving  the  output  in¬ 
formation.  In  Standby  Redundant  Systems,  tasks  are  executed  on  active  units  in  the 
system.  When  a  failure  of  an  active  unit  is  detected,  the  system  attempts  to  replace 
the  faulty  unit  with  a  spare  unit,  see  Bouricius,  Coorter,  Jessep  and  Schneider(l969). 
Hybrid  Redundant  Systems  consist  of  massive  redundant  cores  with  spares  to  replace 
failed  modules,  see  Losq(1976).  In  Gracefully  Degrading  Systems,  all  operative  units  in 
the  system  are  kept  active  for  executing  tasks.  Upon  the  detection  of  a  unit  failure 
the  system  attempts  to  reconfigure  the  remaining  operative  units  and  continue  oper¬ 
ation,  see  Borgesson  and  Freitas(l975).  The  reader  is  referred  to  an  exellent  paper  by 
Avizienis(1978)  for  a  more  thorough  discussion  on  the  concept  of  fault- tolerance  in  digital 
systems  and  a  chronological  view  of  the  evolution  of  fault-tolerant  systems. 

A  substantial  literature  exists  for  developing  and  analyzing  reliability  models  of  these 
fault-tolerant  digital  systems,  see  e.g.  Arnold  (1973),  Beaudry  (1978),  Borgerson  and 


Freitas  (1975),  Bouricius,  Coorter,  Jessp  and  Schneider  (1969),  Castillo  and  Siewiorek 
(1981),  Costes,  Landrault  and  Laprie  (1978),  Gay  and  Ketelson  (1979),  Huslende  (1981), 
Iyer,  Donatiello  and  Heidelberger  (1984),  Koren  and  Sue  (1979),  Krishna  and  Shin  (1983), 
Makam  and  Avizienis  (1981,1982),  Mathur  and  Avizienis  (1970),  Meyer  (1980,1982), 
Meyer  and  Furchtgott  and  Wu  (1980),  Ng  and  Avizienis  (1977,1980),  Oda,  Tohma  and 
Furuya  (1981),  Osaki  and  Nishino  (1980),  Seth  and  Lipsky  (1983),  Sonerio  and  Suk 
(1980),  Trivedi  (1982)  and  others.  Validation  of  models  for  such  high  reliable  systems  has 
also  been  discussed,  see  e.g.  Trivedi,  Gault  and  Clery(1980).  No  general  computational 
schemes,  however,  have  been  developed  for  evaluating  important  reliability  measures. 

The  purpose  of  this  paper  is  three-fold.  First,  a  general  stochastic  model  for  fault- 
tolerant  computer  systems  is  developed.  Underlying  distributions  of  interest  can  be  sys¬ 
tem  state  dependent,  incorporating  possible  interdependency  among  multiple  modules. 
The  distributions  are  not  restricted  to  exponential  distributions.  The  model  is  general 
in  that  any  Markov  chain  model  in  the  literature  can  be  viewed  as  a  special  case,  in¬ 
cluding  the  unified  reliability  model  of  Ng  and  Avizenis(1980),  and  provides  substantial 
modeling  flexibility  for  the  performance  analysis  of  such  systems.  Second,  various  impor¬ 
tant  performance  measures  (some  of  them  are  new)  are  treated  in  a  unified  manner.  In 
particular,  several  performability  measures  are  discussed  concerning  the  computational 
capacity  of  the  system  in  the  time  interval  [0,t).  Finally  efficient  and  general  computa¬ 
tional  procedures  are  developed  for  evaluating  all  of  these  performance  measures,  using 
the  uniformization  technique  of  Keilson(  1974, 1979). 

To  the  author’s  best  knowedge,  the  concept  of  the  performability  of  the  system  can 
be  traced  back  to  early  60’s.  The  distribution  and  the  moments  of  functionals  of  Markov 
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renewal  processes  was  studied  by  Jewell(1963)  and  Mclean  and  Neuts(1967).  Relevant 
weak  and  strong  laws  were  also  examined  by  Pyke  and  Schanfale(l964).  In  an  excellent 
and  substantial  paper  by  C'mlar(l969),  functionals  of  semi-Markov  processes  were  ex¬ 
amined  where  the  transform  of  the  time  dependent  distributions  of  such  functionals  and 
recursion  formulae  for  the  moments  thereof  were  established.  Keilson  and  Rao(l970,1971) 
studied  the  limiting  behavior  as  t  — ►  oo  of  processes  defined  on  Markov  chain,  having 
state  dependent  growth  rate. 

Recently  the  concept  of  the  performability  of  the  system  has  been  revitalized  in 
the  context  of  fault  tolerant  computer  systems,  see  Meyer(l980,1982).  Iyer,  Donatielio 
and  Heidelberger(l984)  developed  a  recursion  formulae  in  a  Markov  chain  context  using 
the  spectral  representation.  These  recent  papers  failed  to  provide  the  reference  to  the 
relevant  previous  work  described  above.  The  results  of  Qinlar(l969)  were  derived  based 
on  renewal  type  arguments.  In  this  paper,  we  provide  an  independent  and  totally  analytic 
proof  for  the  Markov  chain  case.  An  extension  of  this  analytic  proof  to  semi-Markov  case 
is  straightforward. 

In  Section  1,  a  general  stochastic  model  will  be  developed  for  fault  tolerant  computer 
systems.  The  model  enables  one  to  incorporate  time  dependent  availability,  reliabil¬ 
ity  and  performability  measures  of  such  systems.  These  performance  measures  will  be 
classified,  in  Section  2,  into  three  categories:  (A)  availability  and  reliability  measures  in¬ 
dependent  of  computational  capacity;  (B)  performability  measures  involving  cumulative 
computational  capacity;  (C)  performability  measures  involving  computational  capacity 
during  the  first  passage  time  to  the  system  failure.  Most  of  performance  measures  in  the 
category  (A)  are  traditional.  The  performance  measures  in  the  category  (B)  are  con- 


cerned  with  the  time  dependent  performability  of  the  system  which  have  drawn  attention 
of  researchers  during  the  last  few  years.  One  of  performance  measures  in  the  category 
(C)  was  first  introduced  and  analyzed  by  Beaudry (1978).  We  will  extend  this  work  in  a 
more  systematic  manner.  Section  3  through  5  will  be  devoted  to  develop  general  compu¬ 
tational  schemes  for  evaluating  the  performance  measures  in  the  categories  (A)  through 
(C)  respectively.  Finally  in  Section  6,  a  numerical  example  will  be  given,  illustrating  the 
efficiency  of  the  computational  procedures  developed. 


It  should  be  noted  that  the  underlying  distributions  are  not  restricted  to  exponential 
distributions.  Futhermore  N  modules  are  not  necessarily  mutually  independent.  For 
example,  suppose  that  the  up-time  of  the  module  t  is  an  Erlang-2  random  variable. 
Upon  a  failure,  it  takes  a  random  duration  exponentially  distributed  before  the  repair 
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starts.  The  repair  time  itself  is  also  exponentially  distributed.  One  then  sets  Si  = 
{0,  1,  2,  3}  where  0  represents  the  failed  state  under  repair;  1  represents  the  failed  state 
with  no  repair;  2  represents  the  operative  state  under  the  second  phase  of  the  up-time; 
and  3  represents  the  operative  state  under  the  first  phase  of  the  up-time.  Moreover 
parameter  values  involved  could  depend  on  the  states  of  other  modules.  Hence  the  model 
could  incorporate  Phase-type  distributions  of  Neuts(198l)  with  possible  interdependence 
among  modules.  Thus  the  model  considered  here  is  quite  general  and  provides  substantial 
modeling  flexibility  for  the  performance  analysis  of  fault  tolerant  computer  systems. 

The  computational  capacity  of  the  whole  system  can  be  characterized  by  a  mapping 
$  :  S  — *  R+.  The  value  4>(J_(t))  represents  the  maximum  amount  of  computation  per 
unit  time  that  the  system  can  provide  at  time  t.  We  decompose  the  state  space  S  into 
two  subsets  G  and  B  where 

(1.5)  G  =  {m  :  m  6  S,  m/y+i  =  1}  and  B  =  S  \G  . 

The  subset  G  is  called  a  good  set  since  4>{m)  >  0  for  any  m  £  G.  Similarly  the  subset  B 

is  called  a  bad  set  since  4>[rn)  =  0  for  any  m  €  B  . 

In  some  applications,  it  may  be  of  interest  to  study  the  computational  capacity  of  the 
system  for  specific  jobs.  We  assume  that  a  set  of  jobs.  C ,  to  be  processed  by  the  system 
consists  of  M  different  classes  C;,  1  <  j  <  M.  As  before  the  computational  capacity  of 
the  system  for  the  jobs  in  C;  is  characterized  by  a  mapping  <p;  :  S  —>  R+ .  It  should  be 
noted,  however,  that  modules  required  for  processing  jobt  may  vary  depending  on  classes. 
Hence  it  is  possible  to  have  <j>j{rn)  —  0  for  some  m  E  G.  Accordingly  we  also  define  a 
good  set  Gj  and  a  bad  set  Bj  for  each  class  C;,  i.e., 

(1.6)  Gj  =  {m  :  m  €  5,  <£;  (?Zi)  >  0}  and  B}  —  S  \  G}  . 
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We  note  that  G}  C  G  and  B  C  Bj  for  all  j. 

The  decomposition  of  the  state  space  into  subsets  G  and  B  together  with  computa¬ 
tional  capacity  function  <j>  enables  one  to  describe  the  system  behavior  more  accurately. 
In  particular,  it  is  often  assumed  in  the  modeling  of  multiprocessor  digital  systems  that 
the  failed  state  of  the  system  due  to  the  failure  in  coverage  and  the  failed  state  of  the 
system  due  to  the  failure  of  all  modules  are  the  same,  see  e.g.  the  state  0  in  Figure  7 
of  Beaudry (1978).  In  our  model,  this  distinction  can  be  made  clearly,  allowing  one  to 
introduce  different  distributions  for  recovery  times.  This  point  will  be  illustrated  through 
a  numerical  example  in  Section  6. 

The  transition  rate  matrix  ^  of  (1.4)  depends  heavily  on  the  system  structure.  By 
specifying  u,  any  Markov  chain  model  appeared  in  the  literature  can  be  viewed  as  a  spe¬ 
cial  case  of  this  model.  In  the  next  section,  we  introduce  various  performance  measures 
concerning  availability,  reliability  and  performability  of  the  system  for  a  general  transi¬ 
tion  rate  matrix  ^  and  general  computational  capacity  function  (j>  and  </>;.  The  general 
computational  schemes  for  calculating  these  performance  measures  will  be  also  developed 
in  the  following  sections. 
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§.2  Performance  Measures 

Several  different  traditional  and  performance  related  reliability  measures  have  been 
considered  for  fault  tolerant  computer  systems  in  the  literature.  These  performance 
measures  are  often  used  to  compare  alternative  configuration  for  fault  tolerance.  In  this 
section,  we  present  both  time  dependent  and  stationary  performance  measures  concern¬ 
ing  availability,  reliability  and  performability  of  the  general  model  developed  in  Section  1. 
This  provides  a  concise  summary  of  key  performance  measures  discussed  in  the  litera¬ 
ture.  Furthermore  some  important  new  performance  measures  are  also  introduced.  The 
performance  measures  we  consider  in  this  section  are  classified  into  three  categories. 

(A)  Availability  and  reliability  measures  independent  of  computational  ca¬ 
pacity 

These  measures  are  intended  to  provide  information  about  the  state  of  the  system 
and  have  no  relevance  to  computational  capacity  of  the  system  at  different  system  states. 

(AX)  State  probability 

Let  aT  be  the  initial  state  probability  vector  for  the  multivariate  process  J(f),  i.e., 

(2.1)  aT  =  (ajnjmeS ;  %  =  P[J(0)  =  mj,  me  S. 

The  state  probability  vector  at  time  t  given  aT  is  clearly  of  interest.  We  denote  this 
vector  by 

(2.2)  pT(*|a)  =  (pm(t|a))me5  ;  Pm{t\a)  =  P\J{t\a)  =  rrv,  me  S. 

where  ./(f|a)  denotes  the  state  of  the  system  at  time  t  given  the  initial  state  probability 
vector  aT.  When  there  is  no  confusion,  we  will  write  pr(t)  instead  of  pr(f|a).  When  the 
transition  rate  matrix  ^  is  irreducible,  the  ergodic  state  probability  vector  exists  where 

(2.3)  eT  -  lim  pT{t\a). 

'  '  t->oo  — 
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(A2)  Point  and  interval  availability 

The  point  availability  A[t\a)  at  time  t  is  the  probability  that  the  system  is  operational 
at  time  t  given  aT .  The  interval  availability  AI(t,T\a)  at  time  t  is  the  expected  fraction 
of  interval  (t,t  +  t)  during  which  the  system  is  operational.  Formally  we  define 

(2.4)  A(t\a)  =  P[J{t\a )  €  G] 
and 

1  ft+T 

(2.5)  AI(t,r\a)  =  -E  J  I[x\a)dx 
where  /(i|a)  is  the  indicator  function  of  (1.1)  given  a.  At  ergodicity,  one  has 

(2.6)  A oo  =  lim  A(<|a)  =  lim  A/(t,r|a)  =  AI^  . 

t— >oo  t—oo 

(A3)  Time  to  first  system  failure  and  related  reliability 

Suppose  that  J(0)  =  9  where  6  £  G.  Of  interest  is  the  time  until  the  first  system 
failure.  This  is  the  first  passage  time  of  the  multivariate  process  J_(t)  from  9  £  G  to  the 
bad  states  B ,  defined  by 

(2.7)  Te  B  =  inf {t  :  J{t)  £  B  |  J(0)  =  9e  G}. 

If  J_(t)  has  an  initial  state  probability  vector  a7",  then  the  corresponding  first  passenge 
time  TaB  would  be  a  probability  mixture  of  Tg  B  weighted  by  ag_  where  T*  B  =0  with 
probability  one  for  9  £  B.  Typically  the  system  is  operative  at  time  t  —  0  and  one  has 
ag_  =  0  for  9  £  B.  Otherwise  T^b  has  mass  ai  at  the  origin. 

Of  interest  is  the  cumulative  distribution  function  of  T^b  defined  by 

(2.8)  PaflW  =  P\TaB  <  x}. 
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An  important  reliability  measure  is  the  probability  that  the  system  will  continue  to  be 
operative  for  the  period  longer  than  x  given  g? .  We  denote  this  reliability  measure  by 
JZa(x).  One  then  has 

(2.9)  Ra{x)  =  P\TaB  >  x]  =  1  -  Fob(x). 

Also  of  interest  are  the  moments 

(2.10)  E[T*b],  k  =  1,2, ... , 
and  the  a-reliable  mission  time  ra  defined  by 

(2.11)  ra  —  sup{x  :  Ra(x)  >  af?a(0+)}. 

We  note  that  the  first  moment  E[T^B\  1S  the  mean  time  to  failure. 

(A4)  Time  to  next  system  failure  at  time  t  and  interval  reliability 

Given  an  initial  state  probability  vector  aT,  let  Tas\t  be  the  time  to  the  next  system 
failure  from  time  t  if  the  system  is  up  at  time  t  and  zero  otherwise.  We  define 

(2-12)  FaB\tix)  =  P[TaB\t  <  *]• 

We  note  that  T^B |(  has  mass  FaB|((0+)  =  1  —  A(t|a)  at  the  origin.  Of  interest  is  an 
interval  reliability  RIa\t(x)  given  by' 

(2.13)  RI^t{ x)  =  P\Ta_B]t  >  x\  =  1  -  F?B]t(i). 

J?/a|f(i)  is  the  probability  that,  given  a,  the  system  will  continue  to  operate  until  time 
t  +  x  from  time  t.  Corresponding  to  (2.10)  and  (2.11),  we  are  interested  in  the  moments 

(2-14)  E\TaB\t\,  *  =  1-2 . 
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and  the  a-reliable  interval  mission  time 


(2.15)  ra\ t  =  sup{x  :  RIa{t{x)  >  <*RIa\d°+)}- 

In  some  applications,  one  may  be  interested  in  the  conditional  random  variable  T^s |f 
given  that  TaB\t  >  0.  We  denote  this  conditional  random  variable  by  T^r  Reliabil¬ 
ity  measures  in  (2.121  through  (2.15)  can  be  modified  accordingly  in  a  straightforward 
manner.  For  example,  one  has,  corresponding  to  (2.13), 

(2.16)  RI^t(x)  =  >  x\  =  RIa\t(x) / A(t\a) . 

Here  R/+f(x)  is  the  probability  that  the  system  will  continue  to  be  operative  until  time 
t  +  x  given  that  it  is  operative  at  time  <,  with  the  initial  probability  vector  a. 

(A5)  Stationary  reliability  measures 

When  the  transition  rate  matrix  g  is  irreducible,  the  multivariate  Markov  process 
J_(t)  is  ergodic.  Hence  the  random  variable  TaB\t  converges  in  distribution  to  a  random 
variable,  say  S ,  as  t  — ♦  oo.  The  limiting  random  variable  S  denotes  the  time  until  next 
failure  at  ergodicity.  We  see  that 

(2.17)  Fs{x)  =  P(S  <  x)  =  Hm  FaB (t(x). 

With  probability  .Fs(0+)  =  1  -  A the  system  is  not  functioning  at  ergodicity.  Hence  S 
has  mass  1  —  A oo,  at  the  origin.  Stationary  reliability  measures  corresponding  to  (2.13) 
through  (2.15)  can  be  found  by  letting  t  -+  oo.  In  particular  we  define  the  stationary 
interval  reliability  RIs{x)  by 

(2.18)  RIs(x)  =  P\S  >  x)  =  Hm  rt/«,|t(*). 
and  the  a-reliable  stationary  mission  time  by 

(2.19)  Ta  ~  sup{x  :  RIs{x)  >  aRls( 0+)}. 


11 


The  conditional  random  variable  T+B^t  also  converges  in  distribution  to  S+  =  S|s>0 
as  t  — ►  oo.  Stationary  reliability  measures  for  S+  can  be  obtained  similarly.  One  has,  for 
example, 

(2.20)  RIsix)  =  P\S+  >  A  =  RIs{x)/A0 o. 

(A6)  Quasi-stationary  reliability  measures 

Suppose  that  the  system  has  been  operating  for  a  “long  time”.  One  then  wishes 
to  know  how  long  it  will  take  from  current  time  until  the  first  system  failure.  As  for 
the  stationary  case,  the  conditional  random  variable  of  Ta  g|(  given  that  J( 0)  €  G  and 
TaB  >  t  converges  in  distribution  to  a  random  variable,  say  Q,  as  t  — ♦  oo.  This  limiting 
random  variable  Q  is  called  the  quasi-stationary  exit  time  from  G,  see  e.g.  Keilson(1974, 
1979).  More  formally  we  define 

(2.21)  FQ(x)  =  P[Q  <  z)  =  lim  P\TaB]t  <  x\TaB  >  t ,  J(0)  €  G]. 

Quasi-stationary  reliability  measure  can  then  be  introduced  in  terms  of  Fq(x).  In  partic¬ 
ular  we  define  the  quasi-stationary  interval  reliability  RIq(x)  by 

(2.22)  RIq{x)  =  P\Q  >  x)  =  1  -  Fq(x\ 
and  the  a-reliable  quasi-stationary  mission  time  t§  by 

(2.23)  =  sup{i  :  RIq{x)  >  q}. 

(A7)  Cumulative  operational  time  during  the  interval  [0,  <) 

Another  important  reliability  measure  of  interest  is  the  cumulative  operational  time 
of  the  system  during  the  time  period  [0,  f)  given  the  initial  state  probability  vector  aT . 
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We  denote  this  random  variable  by  CO(£|a).  One  then  has 

(2.24)  CO(t|a)  =  J  I(x\a)dx. 

Evaluation  of  the  distribution  of  CO(£|a)  is  quite  hard.  We  will  derive  the  expression  of 
the  moments 

(2.25)  E\CO(t\a)k  j,  k  =  1,2. 

Associated  computational  procedure  will  be  also  developed.  It  should  be  noted  that 
E\CO(t\a)\  =  tAI{Q,t\a). 

One  may  also  consider  several  other  compound  reliability  measures  such  as  the  joint 
measure  for  the  number  of  system  failures  during  [0,£)  and  the  system  state  at  time  t, 
see  e.g.  Baxter(1982),  Masuda,  Shanthikumar  and  Sumita(l984),  Shanthikumar(1983) 
and  Sumita  and  Shanthikumar(l984).  It  should  be  noted  that  the  reliability  measures 
described  in  (A5)  and  (A6)  have  not  been  discussed  in  the  context  of  fault  tolerant 
computer  systems. 

(B)  Performance  Measures  Involving  Cumulative  Computational  Capacity 
When  the  computational  capacity  of  the  system  in  operational  state  is  constant  inde¬ 
pendent  of  the  actual  state  of  the  system  (as  in  the  case  of  identical  standby  redundant 
systems),  all  the  performance  measures  described  in  Section  A2  can  be  directly  related 
to  computational  capacity  measures.  However  this  is  not  the  case  in  every  system.  As 
described  in  Section  0,  the  gracefully  degrading  system  reacts  to  a  detected  failure  by 
reconfigurating  the  system  modules,  which  leads  to  a  new  system  state  possibly  with  a 
decreased  level  of  performance.  The  performance  measures  to  be  discussed  in  this  section 
are  concerned  with  the  computational  capacity  of  the  system  in  a  finite  time  interval. 


(Bl)  Cumulative  computational  capacity  in  the  time  interval  [0,t) 

Let  V  (t|a)  be  the  cumulative  computational  capacity  of  the  system  for  the  whole  class 
C  of  jobs  in  the  interval  [0,  t)  given  a.  More  formally  we  define 

(2.26)  V[t\a)—  f  <j>  (J_[x\a))  dx 

J  0 

where  <f>  :  S  — ►  R+  is  the  computational  capacity  function  introduced  in  Section  1  and 
J(t|a)  denotes  the  state  of  the  system  at  time  t  given  a.  As  mentioned  earlier,  the  modules 
required  for  processing  may  vary  depending  on  classes  of  jobs.  Hence  it  may  be  desirable 
to  study  the  computational  capacity  of  the  system  for  Cj  jobs.  We  denote  this  random 
variable  by  V,(f|a)  where 

(2.27)  Vj(t\a)  =  J*  <t>j  (J(x|a))  dx. 

(B2)  Cumulative  computational  capacity  in  the  time  interval  [t,t  -+-  r) 

Of  related  interest  is  the  cumulative  computational  capacity  of  the  system  in  the  time 
interval  [<,t  -f  r),  t  >  0.  Following  the  notation  of  (Bl)  we  define 


(2.28) 

^(t,r|o)  = 

rt+T 

J  <£(i l[x\o))dx 

and 

(2.29) 

rt  +  T 

II 

J? 

jt  4>}  (s/(x|fl))  dx 

Cinlar(l969)  established  the  transform  E[e  explicitly  and  provided  a  recur¬ 

sion  formula  for  calculating  the  moments  of  V(t|a)  in  the  semi-Markov  context.  The 
computational  scheme  for  the  moments  of  K(t|a)  has  been  developed  in  a  recent  paper 
by  Iyer,  Donatiello  and  Hcidelberger(1984)  using  the  spectal  representation  of  the  under¬ 
lying  Markov  chain.  In  section  4,  we  will  provide  an  independent  derivation  of  the  double 


U 


transform  /0°°  e~3t E[e~wV ^^\dt .  Numerical  procedures  for  calculating  the  first  two  mo¬ 
ments  of  the  performability  measures  in  (2.26)  through  (2.29)  will  be  also  developed. 
(C)  Performance  Measures  Involving  Computational  Capacity  during  the 
First  Passage  Time  to  System  Failure 

Performance-related  reliability  measures  involving  computational  capacity  of  a  com¬ 
puter  system  during  the  first  passage  time  to  system  failure  were  first  studied  in  Beaudry 
(1978).  In  what  follows,  we  describe  these  together  with  spaecktL other  related  reliability 
measures.  VVe  discuss  only  the  total  computational  capacity  of  the  system.  The  com¬ 
putational  capacity  of  jobs  in  the  class  Cj  can  be  studied  in  a  similar  manner,  where 
the  mapping  <j>  :  S  —*  R+  should  be  repalced  by  <j>j  :  5  — ♦  R+ .  For  the  future  reference, 
we  indicate  this  by  adding  the  index  j  to  the  expressions  for  the  total  computational 
capacity. 

(Cl)  Computational  capacity  before  the  first  system  failure  and  computa¬ 
tional  reliability 

Given  a,  let  W)a  be  the  computational  capacity  available  from  the  system  before  the 
first  system  failure.  More  formally,  one  has 

T 

(2.30)  W]a  =  /  -%(J(x|a))dx. 

Jo 

We  denote  the  distribution  function  of  W.a  by 

(2.31)  Fw]a_(x)  =  P\\V[a_  <  ij. 

Suppose  that  a  task  requiring  x  units  of  computational  time  is  initiated  at  time  t  =  0. 
Then  the  probability  that  this  task  will  be  computed  without  any  interruptions  due  to 
system  failure  is  given  by  P[1V|5  >  x]  =  1  -  f\v|,,(x).  We  call  this  measure  a  computational 


reliability  denoted  by  Rw^ix)  i-e- 

(2.32)  Rw\ai*)  =  1  -  Fw |4(x). 
Of  related  interest  are  the  moments 

(2.33)  E[Wjl\,  k  =  1,2,... 
and  the  a-reliable  task  length  defined  by 


(2.34) 


ta  =  sup{x  :  BW|a(x)  >  a}. 


We  note  that  ia  is  the  maximum  computational  length  of  a  task  that  has  a  probability 
of  a  or  more  for  being  completed  before  the  first  system  failure. 

(C2)  Time  depend^t  computational  capacity  until  next  system  failure  and 
interval  computational  reliability 

Let  W\a  t  be  the  total  computational  capacity  available  from  the  system  from  time 
t  until  the  next  system  failure  if  the  system  is  operative  at  time  t,  and  zero  otherwise. 
That  is, 

(2.36)  W\att  =  jf  "<l"ie  4>  (J(x|p(t|a)))  dx. 

The  corresponding  cumulative  distribution  function  is  denoted  by 
(2-37)  FW{a_>t(x)  =  PIW|fct  <  x). 

The  interval  computational  reliability  RIw\at[x)  is  then  defined  by 


(2.38)  RJw\aAx)  =  PlW\a,t  >*1  =  1-  Fw\aM- 

RIW\at(x)  is  the  probability  that  the  system  is  operative  at  time  t  and  it  will  successfully 
complete  a  task  of  computational  length  x  before  its  next  failure.  The  counterparts  of 
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(2.33)  and  (2.34)  for  W^t  are 

(2.39)  E[W^t\,  k  =  1,2,... 

and 

(2.40)  la\t  =  sup{x  :  RIw\a,tiX)  ^  aRIW\aA0  + )} 

Here  £Q|t  is  the  a-reliable  interval  task  length,  representing  the  maximum  computational 
length  of  a  task  that  will  be  completed  before  the  next  system  failure  with  probability 
a  or  more,  if  it  is  initiated  at  time  t.  We  note  that  W^t  has  mass  P(i/(fja)  €  B]  at 
the  origin.  The  corresponding  measure  associated  with  the  conditional  random  variable 
=  jyttl,|w  >0  can  be  studied  following  the  argument  in  (A4). 

(C3)  Stationary  computational  measures 

When  the  system  is  ergodic,  the  random  variable  W^t  converges  in  distribution  to 
a  random  variable,  say  Sw,  as  t  — *  oo.  The  limiting  random  variable  Sw  denotes  the 
computational  capacity  of  the  system  until  the  next  system  failure  at  ergodicity.  One 
sees  that 

(2-41)  FSw (x)  =  P[5iv  <  xj  =  Hm  F»'|«,»(i). 

We  note  that  Sw  has  mass  1  -  Aoo  at  the  origin,  i.e.  Fsn.( 0+)  =  1  -  A^.  Stationary 
computational  measures  corresponding  to  (2.38)  to  (2.40)  can  be  found  by  letting  t  — ►  oc. 
We  define 

(2.42)  RIsw(x)  =  P\Sw  >  x\  =  1  -  FSw  (x) 
and 

(2.43)  tSa  =  sup{x  :  RIsiV(x)  >  aRISw{0+)}. 
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The  conditional  measures  associated  with  S+  =  SVls^x)  can  be  discussed  similary. 
(C4)  Quasi-stationary  computational  measures 

The  quasi-stationary  reliability  measures  associated  with  the  quasi-stationary  exit 
time  Q  was  discussed  in  (A6).  In  this  section  we  examine  the  quasi-stationary  compu¬ 
tational  measures.  Suppose  that  the  system  has  been  operating  for  a  long  time.  The 
question  to  be  answered  is  how  large  the  computational  capacity  of  the  system  would  be 
before  the  next  system  failure.  Formally  the  random  variable  Qw  denoting  the  above 
quantity  can  be  defined  by  the  limiting  distribution  of  Wa\t  given  that  J(0)  G  G  and 
TaB  >  t  as  t  — ►  oo.  That  is, 

(2.44)  Fqw  (x)  =  P[QW  <  x\  =  Hm  P[W^  <  x| TaB  >  t,  J{ 0)  G  G\ 

Quasi-stationary  computational  measure  can  then  be  introduced  through  Fqw  (x).  In 
particular,  we  define  the  quasi-stationary  interval  computational  reliability  RIqw  (x)  by 

(2-45)  RIQw  (x)  =  P[QW  >  xj  =  1  -  FQw  (x) 

and  the  a-reliable  quasi-stationary  task  length  by 

(2.46)  1%  =  sup{x  :  RIQw{x)  >  a}. 

The  computational  capacity  measures  described  in  (C3)  and  (C4)  have  not  been 
discussed  in  the  literature. 
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§3.  Numerical  Procedures  for  Computing  Performance  Measures  in  Category 

(A) 

As  we  will  see,  all  performance  measures  described  in  Section  2(A)  can  be  expressed  in 
terms  of  the  time  dependent  state  probability  vector  pT(t)  of  the  underlying  Markov  chain 
J(t),  possibly  with  certain  modifications.  Hence  for  the  computation  of  these  performance 
measures,  it  is  necessary  to  develop  efficient  numerical  procedures  to  evaluate  pT (t\a).  In 
the  next  subsection,  we  show  that  the  uniformization  procedures  of  Keilson(1974,  1979) 
provides  the  computational  vechicle  needed  for  this  purpose. 

3.1  State  probability 

We  have  assumed  that  the  underlying  process  J_[t)  is  a  finite  Markov  chain  in  con¬ 
tinuous  time  on  S  governed  by  transition  rate  matrix  ^  =  [i^nn]-  Let  p(t)  =  |prnn(0)  be 
the  transition  probability  matrix  of  J(t),  that  is 

(3.1)  Pmn{t)  =  P[l{t)  =  n|J(0)  =  m],  m,n  6  S. 

Let 

(3-2)  =  XI  t'Vnn- 

Since  the  cardinality  of  S  denoted  by  |S|  is  finite,  there  exists  a.  positive  u  such  that 

(3.3)  SUP  Urn  <  V. 

mes 

A  Markov  chain  in  continuous  time  is  said  to  be  uniformizable  if  its  governing  transi¬ 
tion  rates  satisfy  (3.3).  All  finite  Markov  chains  in  continuous  time  are  automatically 
uniformizable.  Keilson(l974,  1979)  has  shown  that  the  uniformizability  provides  a  useful 
bridge  between  continuous  time  Markov  chains  and  discrete  time  Markov  chains  in  the 
following  manner. 
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For  notational  convenience,  we  define  the  diagonal  matrix  =D  whose  diagonal  elements 
are  Um  s  ordered  appropriately.  From  the  Kolmogorov  forward  equations,  one  then  has 

(3.4)  g{t)  =  e=  ;  £  =  -gD  +  g- 

Here  Q  is  the  infinitesimal  generator.  Using  the  uniformization  constant  u  defined  in 
(3.3),  we  define 

(3-5)  gt/  =  l~lgD  +  \g 

where  /  is  an  identity  matrix  of  size  |S|.  We  note  that  gv  >  0  and  g^l  =  l  where  1  is 
the  vector  of  length  |S|  having  all  elements  equal  to  1.  Hence  g^  is  a  stochastic  matrix. 
From  (3.4)  one  sees  that  £  =  —  u[l  —  gl  so  that 

(3.6)  g{t)  =  =  ]T)  9fc(0i* 

k=o 

where 

(3.7)  Qk{t)  k  =  0,1,2,.... 

Here  =  l-  Equation  (3.6)  provides  a  bridge  between  a  Markov  chain  in  discrete  time 
governed  by  and  the  Markov  chain  in  continuous  time  governed  by  g.  Hence  given  an 
initial  state  probability  vector  a,  one  has 

oo 

(3-8)  P7(t|a)  =  Y,  9k{t)aTgkl/. 

k= 0 

Equation  (3.8)  enables  one  to  calculate  pr(t|a)  efficiently  via  computer.  Although  the 
expression  involves  an  infinite  series,  the  matrix  norm  of  a*  is  bounded  by  one  for  all  k 
and  the  truncation  point  fc*  may  be  determined  for  a  given  accuracy  e  and  sufficiently 
large  T  >  0  by 

it 

(3.9)  fc*  =  min{A:  :  ^2  q}  (i)  >  1  -  e,  0  <  t  <  T}. 

j=0 
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The  ergodic  probability  vector  eT  can  be  found  by  letting  t  — ♦  oo  in  (3.8).  Alternatively 
er  may  be  found  by  solving  e Tgu  =  eT  and  eTl  =  1. 

3.2  Point  and  interval  availability 

The  point  availability  A{t\a)  defined  in  (2.4)  can  be  expressed  straightforwardly  in 
terms  of  pr(t|a).  We  denote  the  subvector  of  b  of  length  |S|  restricted  to  the  good  set  G 
by  bG.  One  then  sees  that 

(3.10)  A(t\a)  =  p£{t\a)lG. 

For  the  interval  availability  AI{t,r)  =  l-E[f{+T  I{x\a)dx\  in  (2.5),  one  first  observes  from 
the  linearity  of  .£/[•]  and  the  boundedness  of  integral  that  A/(<,r)  =  -  ft  E\I  {x\a)\dx. 

Since  £[/(x|a)]  =  P[;Z.(x|a)  €  Gj,  one  has 

1  Ct+T  T 

(3.11)  AI{t,  r|a)  =  -]  ^(xj a)lGdx. 

The  integral  //+r  p^(x|a)<fx  can  be  computed  from  (3.8)  as 

rt+T  T  , 

(3.12)  /  pT{x\a)dx  =  ^2Qk{t,T)aT^ 

Jt  k=0 

where 

(3.13)  Q*+i(*>r)  =  ~(<7*+i(0  —  Qk+i{t  +  r))  Qk{t,  r)>  ^  =  0,1,2..., 

starting  with  Qo(t,r)  =  ^“^(l  ~  e~'/T). 

3.3  Time  to  first  system  failure  and  related  reliability 
To  find  the  distribution  of  the  first  passage  time  rgs,  we  consider  the 

process  J'(t)  obtained  from  the  original  process  J{t)  by  censoring  transitions 
G,  see  Keilson(1979).  It  is  easily  seen  that  the  infinitesimal  generator  Q'gg 
J*(t)  inside  the  set  G  is  given  by 

(3-14)  Qgg  ~  ~=D:GG  +  fee 
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absorbing 
from  B  to 
governing 


X 


A 


where  h.GG  denotes  the  submatrix  of  a  |S|  x  |S|  matrix  restricted  to  G.  Correspondingly, 

the  transition  probability  matrix  p*  (i)  of  J*(<)  is  given  by 

=G  G 

oo 

(3-!5)  £,<,(*)  =  eiP^G^  =  £  ‘?*(X)g^:GG- 

k=0 

Since  the  reliability  f?a(x)  of  (2.9)  is  given  by  fia(x)  =  P[TgB  >  x]  =  P|J*(i|a)  6  G], 
one  has 

OO 

(3-16)  Pa(x)  =  Y,  ^{^Q^glccla 

*=0 

and 

(3.17)  Fo,b{x)  =  P\TaB  <  l]  =  1  -  -Ra(x). 

We  note  that  if  a^lc,  <  1,  then  Tg  b  has  mass  (1  -  a^lc)  at  the  origin.  For  the  moments 
of  TgB,  one  has  (see  Keilson  (1974,  1979)  or  Neuts(1981)) 

(3-18)  E]T£b  |  =  -^^gIgg^Q. 

where  gGQ  is  the  fundamental  matrix  defined  by 

IcG  =  1=GG  ~  Si/:Gcl  ' 

The  a-reliable  mission  time  ra  of  (2.11)  can  be  found  from  (3.16). 

3.4  Time  to  next  system  failure  at  time  t  and  interval  reliability 

For  given  ar,  the  random  variable  TgB\t  was  defined  in  (A4)  as  the  time  until  next 
system  failure  from  time  t  if  the  system  is  operative  at  time  t,  and  zero  otherwise.  Because 
of  the  Markov  property  of  the  underlying  process  J_{t),  this  random  variable  is  equal  in 
distribution  to  the  ordinary  first  passage  time  having  the  initial  state  probability  vector 
pT(t|a),  i.e. 

(3-20)  Tg  B\t  ^  ^p(t|a)  B- 
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Hence  performance  measures  (2.12)  through  (2.15)  involving  Ta  Bjt  can  be  readily  obtained 
from  (3.16)  through  (3.18)  where  a £  should  be  replaced  by  pj£(<|a).  As  noted  in  2. (A-)' 
Tag\t  has  mass  1  —  A(t|a)  at  the  origin.  The  conditional  performance  measures  involving 
T0+fl ,t  =  B\t  given  T B\t  >  0  can  be  found  accordingly. 

3.5  Stationary  reliability  measures 

We  have  seen  that  the  random  variable  Ta  Bjt  converges  in  distribution  to  S  as  t  — ♦  oc 
when  the  process  J_(t )  is  ergodic.  It  can  be  seen  from  (3.20)  that  5  i  T'B.  Hence  the 
stationary  reliability  measures  described  in  2.(A5)  can  be  computed  using  (3.16)  through 
(3.18)  where  aT  should  be  replaced  by  eT .  The  conditional  measures  can  be  calculated 
accordingly. 

3.6  Quasi-stationary  reliability  measures 

For  the  quasi-stationary  exit  time  Q  introduced  in  2.(A6),  we  assumed  that  the  good 
set  G  is  irreducible,  i.e.  all  states  in  G  can  communicate  each  other  within  G.  Under  this 
condition,  it  is  known  that  Q  is  exponentially  distributed,  see  Keilson(l974,  1979).  More 
specifically  one  has 


(3.21)  Fq[x)  =  1  - 

where  A q  is  the  maximum  eigenvalue  of  the  matrix  g„.GG  A q  may  be  found  either  by 
solving  a  set  of  equations  or  by  the  power  method.  Then  the  quasi-stationary  reliability 
measures  can  be  computed  straightforwardly. 

The  cumulative  operational  time  CO{t\a)  described  in  2.(A7)  can  be  viewed  as  a 
special  case  of  cumulative  computational  capacity  of  the  system,  U(t|u)  discussed  in 
2.(B1),  where  4>{rn )  =  1,  m  6  G.  Hence  numerical  procedures  for  finding  the  moments 
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of  CO(t\a)  can  be  found  from  those  for  finding  the  moments  of  V(tja),  which  we  discuss 
next. 
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§.4.  Numerical  Procedures  for  Computing  Performance  Measures  in  Category 

(B) 

The  results  of  Cinlar(1969)  derived  based  on  renewal  type  arguments  enable  us  to  de¬ 
velop  numerical  procedures  for  evaluating  computational  capacity  of  the  system  described 
in  Section  2(B).  In  this  section,  however,  we  derive  the  results  through  an  independent 
analytic  approach.  An  extension  of  the  proof  to  semi-Markov  case  is  straightforward  and 
is  omitted  here.  The  moments  of  the  performabiiity  measures  are  then  placed  in  a  form 
for  which  the  uniformization  technique  can  be  readily  applied. 

4.1  Cumulative  computational  capacity  in  the  time  interval  [0,f) 

In  order  to  evaluate  cumulative  computational  capacity  of  the  system  in  the  interval 
[0,f),  we  consider  the  process  Z(t)  defined  by 

(4.1)  =  mes 

where  -7 m  >  0.  The  process  Z[t)  increases  at  the  rate  of  7 m  while  the  underlying  process 
J(t)  is  in  state  m.  We  note  that  if  the  initial  state  probability  vector  of  J(t)  is  a,  then 


one  has 

(4.2) 

Z(t)=CO(t\a)  if  7m  = 

1,  rn£G,  and 

7 m  —  0,  otherwise 

(4.3) 

Z(t)  =  V(t\a) 

if  7n,  =  <>(m). 

m  6  S 

(4.4) 

N 

c-v 

II 

e~*. 

la) 

if  Tfm  — 

m£S. 

Let 

(4.5)  Fm(x,l)  -  P[Z[t)  <  x,J(f)  =  m],  meS. 
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We  assume  that  the  initial  distributions  are  given  by 

(4.6)  Fm(x,  0)  =  CLrnDm{x) 

where  Dm  are  an  absolutely  continuous  cumulative  distribution  functions  having  proba¬ 
bility  density  functions  dm{x).  It  can  be  easily  seen  that  Fm(x,  t)  are  then  also  absolutely 
continuous  and  one  can  define 

(4.7)  fm(x,t )  =  —Frn{x,t),  m  6  S. 

We  note  that  =  0  if  x  <  0  or  t  <  0.  The  p.d.f.  of  Z(t)  denoted  by  f{x,t )  is  then 

given  by 

(4.8)  f{x,t)  =  Y, 

meS 

For  the  event  {J(t)  =  m,  Z(t)  =  x}  to  occur,  either  J_(t)  starts  at  m  and  remains 
there  or  J(t)  starts  somewhere,  enters  m  at  time  y,  0  <  y  <  f,  and  stays  in  m  for  the 
period  (y,t).  By  examining  the  probabilistic  flow  of  the  bivariate  process  Z(t))  in 

this  way,  one  finds  that 

roc 

(4.9)  /m(x,f)  =  fm{x  -  T'rnf 1 0)e_l/— f  ^  /  /,(-'  -  -  y)e~l/’^ydy,  m  e  5. 

»€S  ' 

For  convenience,  we  apply  the  uniformization  technique  of  Keilson(1974,  1979)  here. 
Using  the  uniformization  constant  v  and  the  associated  stochastic  matrix  of  (3.5), 
Equation  (4.9)  can  be  written  as 

roc 

(4.10)  =  fm{x-lmt,0)e~,/t  +i'Y,ai'.xm  I  fi[x  -  Kmy,  t  -  y)e~l,ydy,  rn  t  S. 

»es  J0 

Let  i =  /o°°  e~wz  fm{x,t)dx  and  <Pn,(«z,  s)  =  /0°°  /0°°  e~wl~ft  f^(xj)dx  dt.  By  tak¬ 
ing  the  double  transform  of  (4.10)  one  obtains 

(4.11)  £m(u/,s)  =  - - -  <pm{w,0)  f  V  a^:tm^>Aw^s)  ,  TH  t  S. 

s  +  l /  +  TfmW  ~ 


Equation  (4.11)  can  be  expressed  more  succinctly  in  matrix  notation.  We  define  the 
transform  vector  by  &(tv,  s)  =  s)Jn,eS-  The  vector  pT(w,  0)  is  defined  similarly. 

Let  ~i ^  be  the  diagonal  matrix  having  diagonal  elements  -y rn  ordered  appropriately  and 
define  r(ui,  s)  =  (s  +  i/)l  +  ury  .  One  then  sees  from  (4.11)  that 

(4.12)  £  (u',5)E(^.s)  =  0T(tn,O)  +  I/<P  {w,s)gi/. 

It  can  be  easily  seen  that  (K(u/,  5)  —  v g l/)~l  exists  for  Re(w)  >  0,  Re(s )  >  0.  Furthermore, 
one  has  5(5)  =  /0°°  e~stp(t)dt  =  [(s  +  u)£  -  v<±u)~x  =  (fi(u\s)  -  vgu  —  tt/^  ]-1.  Hence 
from  (4.12)  we  obtain 

(4.13)  i£  (w,s)  =  (w,0)\l  +  u;7r(s)2D]_1^(5). 

Let  i{w,s)  =  f£°e—tE[e-wZW\dt  =  /0°°  /0°°  e-"*-'1  f(x,t)dx  dt.  Then  *(u/,s)  =  £  (w,s)  1 
from  (4.8).  Since  5(5)1  =  1/s,  this  then  leads  to 

(4.14)  XT(Wis)  =  ^£r(u>,s)[Z  + 

If  Z(0+)  =  0,  we  choose  a  sequence  of  {DnJ(x))^L0  so  that  Dn}(x)  — »  Lr(x)  as  j  — *  oc 
where  17  (x)  =  1,  x  >  0  and  U (x)  =  0,  x  <  0.  Correspondingly  <pj(w,0)  —*  c?  as  j  — ►  00. 
One  then  has 

(4.15)  fW)  =  -(1  +  fl(-l)kwkaT{x(s)2n)k  ij. 

Althrough  the  double  inversion  of  (4.15)  is  quite  awkward,  it  does  provide  the  moment 
formula.  One  easily  sees  that 

fOO  L\ 

(4.16)  e~ttE[Zk{t)]dt  =  —  aT[g{s)2D)kL  *  =  1,2,... 
or  equivalently 

(4-17)  E[Zk(t)\=k\j\T{^y)2D}^k)ldy,  *=1,2,... 
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where  {p(Olc}^+1^  =  Jo  p(*  ~  i/)^7 ^{pfy )' 7  Using  the  uniformization  procedure 

described  in  3.1,  we  have: 

(4-18)  E\Z{t)\  =  £  Qn{0,t)aTal2Di 

n= 0 


(4.19) 


c(^!(i)l  =  2  e  E  em4.(o.<)«Tgr2n«;2ni. 


We  note  that  limf_oo  E\Zk[t)\/tk  =  Keilson  and  Rao(  1970,1971)  have  shown  that 

{ Z(t )  —  E[Z(t)\} f  JVzr  Z(t)  converges  in  distribution  to  A"(0,  1)  as  t  — *  00. 
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§.5  Numerical  Procedures  for  Computing  Performance  Measures  in  Category 

(C) 

Given  a  task  requiring  certain  computational  time,  it  is  of  interest  to  find  the  proba¬ 
bility  that  the  task  can  be  processed  without  interruption  due  to  system  failure.  In  order 
to  answer  this  question,  the  distribution  of  the  cumulative  computational  capacity  before 
the  first  machine  failure  is  needed.  This  random  variable  can  be  denoted  by  Z (T^b)  where 
Z(t)  is  defined  in  (4.1).  In  this  section,  we  develop  numerical  procedure  for  calculating 
performance  measures  involving  Z(Tag)  by  employing  the  trick  used  in  Beaudry (1978) 
in  a  more  systematic  manner. 

5.1  Computational  capacity  before  the  first  system  failure  and  computational 
reliability 

The  distribution  and  the  moments  of  Z(TaB )  can  be  obtained  from  the  results  in  3.3  by 
modifying  the  transition  rate  matrix  ^  in  the  following  manner.  If  the  process  J_(t)  enters 
state  m,  it  stays  there  for  an  exponentially  distributed  period  with  parameter  Upon 
the  expiration  of  the  period,  the  process  changes  its  state  to  n  with  probability 
n  €  S.  During  this  dwell  time  in  state  m,  the  process  Z(t)  increases  by  an  exponentially 
distributed  amount  with  parameter  t'm/Kmi  where  this  increment  is  understood  to  be  zero 
if  7m  =  0.  Suppose  we  consider  an  alternative  process  j_{t)  on  5  such  that  when  it  enters 
state  m,  it  stays  there  for  an  exponentially  distributed  time  with  parameter  ~)m. 

As  before,  this  dwell  time  is  zero  if  ~irn  =  0.  At  the  termination  of  this  period,  it  moves 
to  the  state  n  with  probability  r'mn/i'm-  It  is  then  clear  that  the  sequence  of  the  states 
visited  by  j(t)  and  J(t)  are  probabilistically  the  same  provided  that  the  two  processes 
share  the  same  initial  probability  vector  a.  However  the  time  required  for  J_{t )  to  achieve 
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such  transitions  gives  the  accumulated  computational  capacity  during  those  transitions. 

If  '7m  >  0  for  any  m  6  G  (c.f.  cases  in  (4.2)  and  (4.3)),  one  sees  that  J(t)  is  a  Markov 
chain  on  5  governed  by 

^  -y  ~  ^  1/  -v-1  U 

(5  1)  l  -  =DGG=GG  =D.GG=GB 

V  =BG  =BB  y 

The  diagonal  matrix  k D  is  defined  accordingly.  It  can  then  be  readily  seen  that 

(5.2)  TaB  =  Z(TaB) 
where 

(5.3)  TaB  —  >nf(t  :  J(t)  6  B\J_{0)  =  m  with  probability  a™,  m£S}. 

Hence  the  distribution  and  the  moments  of  Z(TaB)  can  be  obtained  from  (3.14)  through 
(3.19)  where  gD.GG  and  \^QG  should  be  replaced  by  kp  GG  anc^  £ gg •  Accordingly 

~=v.GG  =  "  ~kp:GG  +  ^£gG 

and 

oo 

(5-5)  P|f,B  >x]=  E  IkMsail.acia- 

fc=o 

For  the  stationary  computational  measures  described  in  Section  2(C3),  the  initial 
distribution  aT  in  (5.5)  should  be  replaced  by  the  ergodic  vector  eT .  For  the  quasi¬ 
stationary  computational  measures  of  Section  2(C4),  one  has  to  calculate  the  quasi¬ 
stationary  vector  on  G  first.  The  vector  qjl  can  be  found,  for  example,  using  the 
matrix  gv.GG  and  the  power  method  since  qGgu.GG  =  ^QIq-  The  vector  aG  in  (5.5) 
should  be  replaced  by  q^. 


When  7m  =  0  for  some  m  €  G  (such  as  the  case  (4.4)),  the  state  m  becomes  an 
instantaneous  state  for  2(t).  That  is  the  process  J(t)  moves  to  state  n  with  probability 
i/jnn/t'm  as  soon  as  it  enters  m.  To  eliminate  such  instantaneous  states  inside  G,  we 
further  modify  J(t).  Let  b  —  [i^n]  where  b^n  =  Vmn/Vm-  The  good  set  is  decomposed 
into  subsets  H  and  L  such  that 


(5.6) 


(5.7) 


b  = 


>0, 

m  €  G}, 

submatrices 

'  =HH 

=HL 

=HB 

=LH 

=LL 

= LB 

Ubh 

=BL 

=BB 

The  transition  probability  matrix  of  the  corresponding  replacement  process  J^{t)  on 
H  U  B,  eliminating  the  instantaneous  states  in  G,  is  given  by 


j,t  _  [  =HH  B 
=  1  =BH  =BB  . 


(5.8) 
where 

(5.9)  gvw  =b=vw  +  kVL(lLL~  kLLr'kLW'  for  € 

Then  J*(t)  is  a  Markov  chain  on  H  U  B  governed  by 

(5.10)  y}  =  [t'mnl;  "In  =  S  6  11  J  B' 

We  define  as  before  and  the  matrix  ^:[{H 

(5.11) 

The  initial  state  probability  vector  a  must  also  be  modified.  We  define 


t  _  1  t  It 

S w.HH  ~  LHH  -£d-.HH  +  U=HH- 


(5.12) 


§//  -  aJi  +  q£{Lll  hLL)  lkLir 
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In  the  case  of  stationary  or  quasi-stationary  measures,  the  replacement  is  similar  using 
and  (g^.O^)  instead  of  aT.  The  results  in  (3.16)  throuth  (3.19)  then  provides  compu¬ 
tational  procedures  needed  where  and  in  (5.8)  and  (5.9)  should  replace  Q.v.qg 

and  Ojq.  Namely  one  has 

(5.13)  T\b  i  Z(TaJ}) 

where 

(5-14)  T'aB  =  inf {i  :  J}[t)  €  J3|J(0)  s=  m  with  probability  am,  m  £  S}, 

and 

OO  j, 

(5-16)  P\TlB  >  i]  =  Y.  f*(*)«H 

k=0 
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§.6  A  Numerical  Example 

In  this  section,  we  demonstrate  the  computational  procedures  described  in  the  previ¬ 
ous  sections  through  a  numerical  example.  All  figures  are  given  at  the  end  of  this  section. 
A  system  we  consider  here  consists  of  two  processing  units.  Each  unit  fails  through  two 
phases.  We  define 

(6.1)  5,  =  {0,1,2},  l<t<2, 

where  the  state  2  denotes  that  the  i-th  unit  is  in  the  first  phase  of  its  up-time.  The  state  1 
represents  the  operative  state  in  the  second  phase  and  the  state  0  means  the  failed  state 
under  repair.  We  assume  that  all  relevant  distributions  are  exponentially  distributed 
with  parameters  (/x,,  Atl,  A,2)  corresponding  to  of  (6.1). 

The  two  processing  units  interact  with  each  other  in  the  following  manner.  If  the 
t-th  unit  is  down  and  other  unit  is  in  operative  state  k ,  then  the  whole  system  fails  with 
probability  plk  where  t,  k  =  1,2.  When  the  system  is  down  while  one  of  two  units  is 
operative,  the  state  of  the  operative  unit  does  not  change  until  the  failed  unit  is  repaired. 
Let  I(t)  be  the  indicator  function  where  I(t)  =  1  ( I{t )  =  0)  means  that  the  systems  is 

functioning  (down)  at  time  t.  The  state  space  S  =  S\  x  Sj  x  {0.  1}  then  has  the  following 
13  states: 
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(6.2) 


linearized  state  number  state 

0  (0,0,0) 

1  (1,0,0) 

2  (1,0,1) 

3  (0,1.1) 

4  (0,1,0) 

5  (2,0,0) 

6  (2.0,1) 

7  (1,1,1) 

8  (0,2,1) 

9  (0,2,0) 

10  (2,1,1) 

11  (1,2,1) 

12  (2,2,1) 


We  note  that  the  failed  state  of  the  system  due  to  the  failure  in  coverage  (e.g.  (1,0,0), 
(0,1,0),  (2,0,0),  (0,2,0))  and  failed  state  of  the  system  due  to  failure  of  all  modules, 
(0,0,0),  are  clearly  distinguished  here. 

If  Xt{t)  denotes  the  state  of  the  i-th  unit  at  time  t,  then  the  process  J(t)  =  (A'i(<), 
X? {t),I(t))  is  a  Markov  process  governed  by  the  transition  rate  matrix: 
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(6.3)  |/  = 
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We  assume  that  the  total  job  class  C  is  decomposed  into  three  disjoint  subclasses 
Cj,  j  =  1,2,3.  The  processing  capacities  of  the  whole  system  and  each  class  depending 
on  the  state  of  J(t)  are  assumed  to  be  given  below. 

state  number  m  01  23456789  10  11  12 

4>{m)  0  0  12003340  7  8  20 

0  0  0.5  1  0  0  1  1  2  0  4  3  10 

<t>2{m)  00  01000120  1  3  5 

<j>3(m)  00  0.5  0002100  2  2  5 

The  values  of  the  parameters  employed  in  the  numerical  example  are  summarized  below. 


(6.4)  (Mi,  A„,  A,2)  =  (1.0.4, 0.6) 

(^2, ^21,^22)  =  (0.7, 0.2, 0.8) 

.  .  ,  ,  ( 0.2  0.3  \ 

(6-5)  ?  =  =  (,0.5  0.7 )  • 

It  is  assumed  that  the  system  starts  fresh  at  time  t  so  that  J(0)  --  (2,2, 1).  Correspond¬ 
ingly  the  initial  state  probability  vector  is  given  by  aT  =  (0,0, . . .  ,0, 1). 
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Figure  6.4  Figure  6.5 


Figure  0.6  Figure  0.7(a) 


Figure  6.7(b) 


Figures  6.1(a)  and  6.1(b)  depicts  the  time  dependent  state  probabilities  p*(t)  for 
0  <  k  <  6  and  7  <  Jfc  <  12  respectively.  We  observe  that  the  perfect  state  probability 
Pn{t)  decreases  monotonically  to  its  ergodic  probability  eij  =  0.29716,  while  all  other 
state  probabilities  rise  from  zero  to  corresponding  ergodic  probabilities  as  t  — ►  oo,  some  of 
them  monotone  and  some  of  them  unimodal.  The  ergodicity  sets  in  for  t  >  4.  In  Figures 
6.2(a)  and  (b),  the  point  availability  .4(<|a)  and  the  interval  availability  AI(t,T\a)  with 
t  =  1.5  are  plotted.  Both  decreases  monotonically  to  common  ergodic  value  Aoo  —  AIoo  — 
0.88757  as  t  — *  oo. 

Figure  6.3(a)  and  (b)  illustrate  the  cumulative  distributions  FaB{z)  and  FaB\t[t)  of  the 
first  passage  times  TaB  and  TaB\t  with  t  —  1.5  respectively.  We  note  that  has  a  mass 
of  FaB |  ((0  +  )  =  1  —  .4(1.5)  at  the  origin.  Figure  6.4  depicts  the  cumulative  distribution 
functions  Fs{t)  and  Fg(t)  for  the  stationary  and  quasi-stationary  random  variables  5  and 
Q  respectively.  We  observe  that  5  has  mass  of  Fs  (0+)  =  1  —  ^oo  at  the  origin.  The  values 
of  Fs(t)  and  Fg(t)  become  close  for  t  >  10.  We  note  that  these  distribution  curves  enable 
one  to  derive  corresponding  a-mission  times  easily.  For  example,  with  a  =  0.8  is  1.56. 
In  Figure  6.5,  the  mean  cumulative  operational  time  of  the  whole  system  £[CO(t|q)j 
is  plotted.  We  see  that  £[CO(t|a)j  is  almost  linear  having  the  slope  .4oo  =  0.88757. 
In  Figure  6.6,  the  mean  computational  capacities  for  each  job  class  and  entire  job  class 
are  given.  It  is  observed  that  all  curves  are  concave-shaped  and  become  quite  linear  for 
t  >  3.  Figure  6.7(a)  depicts  the  distributions  of  computational  capacities  before  the  first 
system  failure  for  each  job  class  and  the  entire  job  class.  Figure  6.7(b)  and  (c)  provide 
corresponding  curves  at  stationarity  and  quasi-stationarity.  These  distribution  curves 
again  enable  one  to  evaluate  the  corresponding  a-mission  times. 
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